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Abstract 

We present a C-language implementation of the lambda-pi calculus 
by extending the (call-by-need) stack machine of Ariola, Chang and 
Felleisen to hold types, using a typeless- tagless- final interpreter 
strategy. It has the advantage of expressing all operations as folds 
over terms, including by-need evaluation, recovery of the initial 
syntax-tree encoding for any term, and eliminating most garbage- 
collection tasks. These are made possible by a disciplined approach 
to handling the spine of each term, along with a robust stack-based 
API. Type inference is not covered in this work, but also derives 
several advantages from the present stack transformation. Timing 
and maximum stack space usage results for executing benchmark 
problems are presented. We discuss how the design choices for this 
interpreter allow the language to be used as a high-level scripting 
language for automatic distributed parallel execution of common 
scientific computing workflows. 

1. Introduction 

Scientific computing workflows are commonly built at a high- 
level and require verifiability and reproducibility. Although this 
is an ideal match for a functional programming style, almost all 
current approaches are imperative. In practice, these quickly be¬ 
come special-purpose and non-composable when dealing with 
large-scale projects. Several frameworks have emerged to use 
task-level dependency information to orchestrate data and work 
distribution, including HT-CONDOR J^ . Pegasus®, and many 
smaller projects such as REXEC® and Fireworks fl^. This also 
includes middle-ware libraries such as ADIOSd] and StarPU.jl 
One prominent example of what has been accomplished for a few 
specialized problems are the SETI@HOME and Folding ©HOME 
projects. Internally, many individual codes also uses dependency 
information to exploit fine-grained parallelism. For example, the 
NAMD2 molecular dynamic s fT^ uses the Charm-l-l- library, d 
and large-scale multiphysics PDE-s solvers are moving in this di¬ 
rection as well.j^ 

The idea that computer science can provide useful high-level 
abstractions for scientific computing is well-established. li2^ How¬ 
ever, applications of functional language constructs in high-performance 
and distributed computing are just beginning to receive widespread 
recognition. All three of the Darpa high-productivity grand chal¬ 
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lenge languages, XIO, Fortress, and Chapel, ca included some 
language features for specifying programs using their mathemati¬ 
cal properties. New languages like Tupeware® are also emerging 
to facilitate distributed workflows. 

All of these languages have been designed with user-directed 
parallelization in mind. That design strategy targets parallelization 
of large, single-purpose, monolithic codes. At the end of a run 
with such tools, all the data must be serialized, stored, and then 
re-arranged for use with the next analysis. For example, molecu¬ 
lar dynamics with NAMD2 can be run on multiple, slightly dif¬ 
ferent molecular systems to generate millions of large arrays for 
later analysis. Later, those sets of large arrays might be subject to 
map-reduce type computations, as well as interactive analysis. The 
entire workflow may later need to be re-run with slightly different 
starting parameters or continued from the last run. These glue steps 
almost always require custom code, and consume a large portion of 
developer time in these workflows. 

Our vision of a high-level language for distributed storage and 
execution of parsed data requires a new virtual machine that pre¬ 
serves term structure. The central objects in this workflow will not 
be files, but functional syntax trees. Turning this picture into reality 
requires the ability to automate serialization of all data objects and 
make retrieving and working with remote data transparent. When 
required, single-purpose codes can be run as primitive operations, 
executing on-demand as part of a global virtual filesystem / code 
ecosystem. The evaluation method presented in this paper takes an 
important step by guaranteeing that every evaluation intermediate 
can always be serialized, and that evaluation can stop at any point. 
With this, fundamentally new applications are possible. 

We present details of an efficient, by-need evaluation strategy 
on type-annotated terms that are capable of encoding the An calcu¬ 
lus. The bulk of the paper is devoted to introducing the stack form 
and proving its equivalence with the usual, syntax tree encoding 
of lambda terms. Reduction and normal terms are then defined in 
terms of the head-values in stack form. The equivalence between 
forms is directly proven by our implementation, which provides 
functions converting back and forth between initial and final repre¬ 
sentations. 

After detailing the term encoding and reduction machine, this 
paper describes the fold structure based on the typeless- tagless- in¬ 
terpretation strategy, din including all the details on garbage col¬ 
lection, refcounting, and handling of primitives. Next, timings and 
space usage for two simple benchmarks are provided and compared 
to the Glasgow Haskell Compiler (GHC).(^ The implementation 
validates the claimed invariant properties of the stack-form, and 
shows that the stack size remains manageable for even large prob¬ 
lems. We end by discussing improvements that are possible, along 
with code parallelization strategies. 
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Initial Term 

Next Step 

Stack Action 

Apply N M 

N 

pushAppl M 

Lambda Xi : N —> M 

M 

pushCtxt Xi : N = popAppl() 

LetRec Xi : N = L in M 

M 

pushCtxt Xi : N = L 

Ctor : N 

Done 

Val = Ctor wind(N) 

Dtor n 

Done 

Val = Dtor Xi 

Prim : N 

Done 

Val = Prim wind(N) 

Var n 

Done 

Val = Var Xi 

VarT n 

Done 

Val = VarT Xi 


Figure 2. Specification of the winding operation, which mutates a 
stack and terminates by setting the spine’s head-value. N, M are 
abstract syntax trees (the initial encoding) composed of the terms in 
the diagram. Variable translation from de-Bruijn indices to context 
pointers, Xi, occurs for the Var / VarT / Dtor rules by walking n 
elements up the context for the current stack. Open binders occur 
when no pending applications are present (i.e. when popAppl() is 
unsuccessful). 


Prior Work By-need reduction strategies have been extensively 
investigated as fully lazy evaluation methods. 01,11111 More re¬ 
cently, Chang and Fellesein provided a new characterization of nor¬ 
mal terms and proofs of uniqueness, correctness, standard reduc¬ 
tion, and observational equivalence. (1 That work also introduced 
the single reduction axiom, as well as a reduction machine. Here, 
we fill in missing details in that picture by re-stating their reduction 
axiom and proving invariants maintained by the stack representa¬ 
tion. 

Part of the motivation for this work to find a better evaluation 
method for exploiting implicit parallelism. Parallel extensions to 
Haskell have been in-progress for some time. Dl This goal is also 
shared by recent innovations in templateand cloud Haskell. @] 
Effective run-time code distribution still remains a challenge for 
both of those frameworks. 


2. Stack Representation 

The interpreter uses a well-known transformation to turn every 
lambda term into a single stack (Fig|2ll- Each stack represents a 
spine for a lambda-term (see Fig. [^. In this form, the head-value 
is trivially available as the stack bottom (variable reference, con¬ 
structor, etc.), the list of all reachable variable bindings is available 
within the context, and, until this stack is to be substituted some¬ 
where else, all the head-value’s pending applications are visible as 
well. Spines move to the left of applications and inside of lambda- 
s, and so include a list of bound variables, as well as leftover, un¬ 
matched, lambdas and applications. The critical information stored 
by our evaluator for each spine is listed in Fig.[T] 

The context is a series of type-annotated let-bindings ‘owned’ 
by the stack’s closure. They are used for when needed by terms 
at the bottom of the stack, and for determining which variables 
represent function arguments. The unpaired (open) binders are de¬ 
noted by (X). They have type annotations, but no right-hand side. 
There are no free variables in this representation. A special rule pre¬ 
vents non-termination by dereferencing recursive bindings. Vari¬ 
able references can point at either a term or a term’s type annota¬ 
tion. Even though variable references are actually pointers to their 
bound value, we carry the complete context to allow recovery of 
the initial syntax-tree encoding. 

Each context ends logically at the context where the stack 
branched from a unique larger, enclosing, term. These are shown 
schematically by arrows in the example of Fig. Together, the 
end pointers have enough information to reconstruct the pairing of 
@-A-s in the initial, syntax tree, representation. 
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This work aims to manipulate A terms entirely in stack form. 
Accordingly, we have defined stacks in Fig. [T] so that the stack and 
syntax-tree encodings are equivalent representations of the same 
object. Our major contribution is proving that a set of invariants 
guarantee this equivalence, and are preserved by a sensible reduc¬ 
tion strategy. We then leverage this representation to define reduced 
terms based on the stack definition, rather than the universally en¬ 
countered initial definition of normal form. We will also describe 
some other major advantages of working with the stack form. 

Many transformations that appear very difficult in the initial 
syntax tree representation become almost trivial in stack form. One 
of those advantages is the ability to make use of type annotations 
present for each binder. In the An calculus, terms and types inhabit 
the same space. The syntax definition in Fig.[T]makes this explicit, 
and at the same time loosens some of the conventional type-theory 
restrictions. In particular, in An, the type of functions from values, 
X, of type A to values of type B{x) is (Hx : A.B{x)). The H is a 
binding construct, since B can be an arbitrary function of x. It is 
different from A only to denote that the whole term must evaluate to 
a type-level quantity. However, if B{x) is another H function type, 
then this restriction just means that the value returned by B{x) is a 
type. Therefore, H terms can be curried in exactly the same way as 
functions. 

For simplicity, we choose to represent , Hx : A.B{x) as just 
Ax : A —>■ B{x). The only difference from An is that a H value 
would have an earlier guarantee to be a type-level quantity. With 
this perspective, H is just a marking on a A value that guarantees it 
will eventually return a type. Finally, we include two special con¬ 
structors for closing the type hierarchy. A 0-argument hole con¬ 
structor, (? : N), to indicate an unknown term, and a 0-argument 
type-universe, (* : *). In this way, type schemes can be translated 
to terms by applying a function to an unknown value. For exam¬ 
ple, the type of the identity function, Va.a —>■ a, can be written 
as (Aa : * ^ A_ : a —>■ a)(? : *). Applying this type to a right- 
hand side actually shows that it is the typeof function, taking an 
argument to its type. 

We note that to be completely correct, * should be parameter¬ 
ized by a universe level, or else it is possible to run into a recursive 
definition problem known as Girard’s paradox. This universe level 
is also closely connected to subtyping, which must distinguish ★ 
with unknown arity from, e.g. (Int : *o). Although preliminary re¬ 
sults seem promising, this work will assume all terms are correctly, 
statically typed, and will not address type inference issues. 

With this shortcut, it also becomes trivial to evaluate the type of 
every term that already has properly annotated binders. The type of 
each term can be found by typing the bottom of a stack, and then 
dressing it with a new copy of all the binders in the stack’s context. 
For example, the type of the function, 

Aa : (?i :*)—>■ A6 : (?2 : -k) ^ a b 

requires solving the unification problem, 

(?i : *) = A_ : (?2 : *) —^ (?3 : *) 

and results in the term, 

let t : k =?2 : k 
u : * =?3 : ★ 

in Aa : (A(f :*)—>■«)—>-a6 

The stack bottom’s type is, in turn, just the type of the head- 
value, after substitution of the head value for its type and evaluation 
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Syntax 


Stack [S],[T],[U] ::= 


Values Val ::= 



end 


t 

let 

XI : [T] = [S]|(X) 


Xr. : [T] = [S]|(X) 

in 

Val [Si] • • ■ [S^l 

Var Xi 



VarT Xi 
Ctor : [S] 
Dtor Xi 
Prim : [S] 


Variable reference 
Type reference 

Base data value / constructor (Int, Cons, ?, etc.) 

Destructor 

Primitive 


Bottom Evaluation Semantics 


Var Xi 

Done, 

Xi = Open or Var Xi is recursive and WHNF 

Var Xi 

appendStack [S] [U] 

evalfii) = [U] 

VarT Xi 

appendStack [S] [U] 

eval('a;i) = [U] 

Ctor : [S] 

Done 


Dtor Xi 

appendStack [S] [U] 

eval(a:i) = Ctor [U] 

Prim : [S] 

Done 

WHNF^ or insufficient args 

Prim : [S] 

appendStack [Sfe+i,.,,] [U] 

SPk [Si] [Sfc] = [U] 


Figure 1. Syntax, stack representation, and by-need reduction rules for a type-annotated A-calculus. Sec. details additional, essential 
restrictions on the locations of open binders, (X), and the ordering of end-pointers for each context. ^WHNF denotes that at least one open 
hinder is reachable from the stack’s context - hence the term is already in weak head-normal form. The special notation in (5-reduction for 
primitives notes that if a primitive consumes k arguments, then those arguments are popped off of the pending applications when it returns. 



let f = 

[let 

X = (X) 

in 

[x]] 

g = 

[let 

y = (X) 

in 

[y]] 

h = 

[let 

z = (X) 

in 

[z]] 

in f [g] 

[h] 





hand sides are immediately apparent, but the difficulty comes in 
accounting for the reference structure while evaluating the term. All 
of these complications are stuffed into the appendStack method. 

Stacks are mutable structures that rearrange during append¬ 
Stack. First, the stack to be appended (the appendee) is copied. 
Next, the appendor’s head value is manually destroyed by calling 
stack_dtor on its internal type annotations, if present. Finally, the 
appendee is inserted in the head position between the last context 
value and the next pending application for the appendor. Explicitly, 


(A/ {\g Xh ^ hgf)\y y){\x x)Xz z 

Figure 3. Example reduction term, showing both initial and stack- 
based encodings. Arrows on the left indicate pairing structure dis¬ 
covered during winding. Arrows on the far right show the end- 
pointers for the context of each hound right-hand side. Each sep¬ 
arate stack is delimited by []-s. 

of the stack bottom. For our example, this procedure results in, 
let t : * =?2 : * 
u : * =?3 : * 

in Aa : (X{t : ^ u) ^ b : t ^ u. 

Type-level holes are used for free type variables so that terms can 
represent type schemes with the same ease as lambda. This simplifi¬ 
cation should accordingly make unification problems much clearer. 
The type annotations on open binders must be unified, followed by 
unification of the remaining context of one stack with the bottom 
of another. For convenience, we also allow variables to reference 
the type annotations on terms as well as the terms themselves. This 
may be too permissive as it leads to multiple equivalent encodings 
- with a type-level hole present in the annotation or as a hound 
right-hand side. 

Example Figure 0 shows the stack winding transformation of 
an example from Chang and Fellesein The application right- 


appendStack([S], [U]) = gc.val([S]); wind([S], get_ast([U])) 


Get_ast retrieves the initial encoding of [U] without modifying [U]. 
The wind step traverses that syntax tree, pairing S’s contexts when 
it finds an open A and pushing its pending applications in front of 
the appendor’s. 

In practice, get.ast uses automatically garbage-collected heap- 
space which is immediately orphaned, and so we have implemented 
appendStack as a direct copy instead. For a direct copy, only the 
variable references (to contexts) internal to the copied stack need 
to be renumbered. This is because variable references are direct 
pointers instead of de-Bruijn indices. The appendee’s entire context 
extends the appendor’s. Winding would have paired [U]’s closed 
binders internally in exactly the same way, while its open lambdas 
bind the appendor’s pending applications. The appendee’s remain¬ 
ing pending applications go in front of the appendor’s. 

In our implementation, stack deallocation (stack_dtor), con¬ 
struction of the initial encoding (get_ast), and even by-need evalu¬ 
ation (need), are implemented using folds over the stack structure. 
These folds are described in more detail in the next section. 


3. Properties 

The context provides a ‘handle’ for traversing the stack. This is 
shown by noting that the stack maintains the all of the following 
invariants: 
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1. All sub-stacks directly referenced from a stack (annotations, 
bound right-hand sides and pending applications) end at a con¬ 
text within the stack or at the stack’s end. 

2. All variables referenced from the stack are reachable on the path 
from the stack’s context pointer (but may be past the stack’s 
end). 

3. A sub-stack of the head value (type annotation or even a nested 
data structure not shown in Fig. [Hi always ends at the start of 
the head value’s context. 

4. The ending points of hound right-hand sides in a given stack 
show the pairing structure of lamhda-s and applications in the 
following way. If, when traversing the context from start to 
end, a right-hand side’s end pointer is another context value in 
the stack, no end pointers (of right-hand sides) on intervening 
contexts can go heyond that context value. Letrec-s are easily 
identified because their context ends at their own binder. 

5. Pending application contexts end at ordered locations, with 
the innermost, first, application closest to the start (having the 
largest context), and the outermost, last, application closest to 
the end (smallest context). 

6. No pending applications have end points between paired binders. 

7. All open binders in a stack are reachable from all pending 
applications for that stack. 

The first means that each referenced spine terminates at some point 
of the main spine’s context. These termination points are what 
hold the stack into a coherent tree - since each stack represents an 
entire self-contained sub-tree. Nevertheless, the number of stacks 
are exactly the same as the number of non-lambda or apply nodes 
in an initial syntax tree encoding, while the number of contexts is 
exactly the number of lambda terms. 

The ordering properties are a consequence of the fact that 
each spine represents a transformed syntax tree. Application of 
a lambda turns into a paired, closed, binding on the context, and so 
open binders have to come before unpaired, pending applications. 
Similarly, a pending application cannot have a context ending in- 
between paired apply-lambda-s. Pending applications are ordered 
from outermost to innermost. 

We prove that as long as these invariants are satisfied, there is 
always a transformation from the stack form back to a syntax tree. 
Conversely, the translation from the syntax tree to the stack form is 
trivial. Because of this duality, the two forms are equivalent. Oper¬ 
ations can then be specified on whichever form is more convenient. 

We then give an evaluation algorithm that simultaneously com¬ 
putes both terms and types. We prove this algorithm maintains the 
above invariants. Normalization is defined implicitly via reducible 
and irreducible head-values. This leads to a computationally useful 
definition, since every stack has a unique head-value, and since the 
reduction proceeds by case analysis on these values. 

They ensure that de-Bruijn indices can always be constructed 
by counting the number of steps to a reference in the context. in 
We now prove two theorems that establish the 1:1 mapping 
between conventional lambda terms and stacks, and a third that 
justifies defining normal forms in terms of a stack form. 

3.1 Winding produces valid stacks 

□ 

First, we assume that winding proceeds from an otherwise valid 
stack. Since winding terminates by setting a bottom value, the 
starting bottom value can be undefined. We can always start at a 
stack with no contexts or pending applications. 


* As long as all variables are bound. 


Since the winding process always proceeds along the left 
branch, we can consider the set of A,@, encountered as a series 
of tokens with unknown order. We need to show that any series 
produces a valid stack. The end-context of an application is the 
nearest A to its left, while the applications always pair with the 
nearest A to their right. Unmatched binders become open contexts, 
while unmatched applications become pending applications. 

Property 1 is simple to show, since the context can only grow 
during winding, and originally starts at the stack’s end point. The 
context at every point during winding contains all binding A-s in the 
tree’s path to the root. Violating property 2 would therefore mean 
that an unbound variable had been found. Similarly, property 3 is 
automatically satisfied by all the winding termination rules, since 
the stack’s context is fixed when winding terminates. 

Properties 6-7 are also simple, since violating (6) would indi¬ 
cating a mis-pairing of A@@, and since an unmatched binder can¬ 
not occur after any unmatched application. Property 4 and 5 can 
be proven by noting the matching process is equivalent to match¬ 
ing parentheses. Each @ acts as a left-parentheses, and each A as 
a right. When traversing the stack from start to end, it is like read¬ 
ing a parenthesized expression backwards. The ending points of the 
bound right-hand sides are everything left of its matched parenthe¬ 
ses. Property 4 follows trivially from this observation. Similarly, 
there may exist interspersed, unmatched A-s and @-s. These are 
always ordered. 

3.2 Valid stacks can always be unwound to initial terms 

We prove this by giving the unwind algorithm in Fig. (4] It proceeds 
in two phases. First, the stack bottom is transformed into its initial 
encoding. For stacks without annotations this is trivial. For stacks 
with annotations, the annotation must be unwound first. Variables 
are simple to resolve to de-Bruijn notation by walking up the 
context (property 2). 

Second, the bottom is wrapped in a series of lambda/apply-s by 
visiting the context from start to end (end is not visited). Initially, 
the contexts of all pending applications are copied, in the same 
order, onto an interrupt stack. Unwinding stops at each interrupt 
to wrap the term in an Apply (unwinding the pending application 
right-hand side). At each binder, the term is wrapped in a Lambda 
tree node (or a LetRec if the right-hand side context ends at the 
current binder). Using the analogy from the previous section, each 
binder in the context is a right-parentheses. The context of its 
corresponding left-parentheses is pushed onto the interrupt stack 
and unwinding proceeds to it. No pending application can interrupt 
this process (by property 6). At this point, the term is wrapped in 
an Apply node. This must match the corresponding Lambda, since 
all other Lambda-s will have been matched using this procedure, 
by property 4. 

Interrupts are guaranteed to be encountered during unwinding 
by property 1. If unwinding encounters an open binder, no interrupt 
is pushed. The interrupt stack will already be clear, by property 7. 

We also observe here that each paired application can be de¬ 
stroyed without spoiling the stack property. By analogy to the 
parentheses matching, none of the ordering properties are altered. 
However, this involves checking the end-pointers of every stack 
reachable from the current one that ends at the collected binding, 
replacing it with the next binder in the series. This can cause a con¬ 
siderable slow-down and should be re-examined in the future. Our 
implementation maintains reference counts for each binder to iden¬ 
tify and remove unused binders immediately. 

3.3 Evaluation does not break validity 

Since evaluation makes use of essentially only 1 rule, this is easy to 
show. First, note that appendStack can be written in terms of wind 
and unwind (Eq. H. Destroying the stack bottom is inconsequen- 
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tial, and results in a valid starting stack for wind. Winding the result 
of an evaluation may encounter any variables reachable from that 
evaluation. Since only right-hand sides are substituted, those vari¬ 
able references fall in one of two sets. Either the variable is internal 
to the right-hand side (below the right-hand side’s end pointer), or 
it is external (at or above end). If internal, that reference will link 
to a binder that will be added during wind. By property 1, that end 
pointer is reachable from the current stack’s context. Therefore, the 
precondition for wind to produce a valid stack has been met. 

Evaluation can trivially stop at any point and still produce a 
valid partial evaluation. 

Next, we should consider the special impact of the primitive re¬ 
duction rule. In our implementation, primitives use an API for ma¬ 
nipulating the first k terms in the pending application stack (where 
k depends on the primitive). These are evaluated by-need when the 
primitive requests their value. The API makes it impossible to in¬ 
sert binders or change context values (other than normal evalua¬ 
tion). The primitive completes by leaving a return value on the ap¬ 
plication stack, and destroying the others. This last remaining value 
forms the new stack bottom. To show this is valid, note that none 
of the ordering properties are violated if any fc — 1 of the first k 
pending applications are removed, since no re-ordering is possible. 
Also, the move of the remaining stack to the bottom also uses ap¬ 
pends tack. 

If more generality is needed, the k pending applications could 
all be re-linked to end at the start of the stack’s context and a 
general lambda-term could be wound into the head value. The 
additional invariants have been very useful for developing and 
testing functions that work with stacks, including fast copies and 
the primitive API. 

4. Unwinding Functions 

Dealing with in-memory references is notoriously difficult in low- 
level languages. This section presents wind and unwind routines 
that standardize computations, eliminating most common errors. 
The major workhorse is the unwind function, which essentially 
works as the foldr of f, starting from val. In higher-level lan¬ 
guages, it is possible to achieve a much more precise manipulation 
of the function, including composition, partial evaluation, and in¬ 
lining. These three together produce the tagless property of tagless 
interpreters. 

Underneath the hood, each member of a deconstructor typeclass 
has to be looked up from a dictionary and replaced with the appro¬ 
priate function during compilation of the evaluator. Here, we make 
that choice explicit by providing different definitions of val, letj, 
let_a, and apply for each type of fold. 

Recovering Initial Encoding (getjist) The visitor structure in 
our implementation is specially designed for this task. It can pro¬ 
duce arbitrary partial terms and context errors by recognizing the 
end pointer of its starting stack and jumping in number to a root 
context, which is visible from another arbitrary place in the term. 
In pseudocode, 

• val (StackVar cp) = val •<— Var (steps from s-^c to cp) 

• let J X t = val <— Lambda x : get_ast(t) ^ val 

• let_a b = val ■<— Apply val get_ast(b) 

• apply b = val •«— Apply val get_ast(b) 

A major source of early difficulties was attempting to both evaluate 
and produce a syntax tree at once. 

Call-By-Need Call-by-need is focused on evaluating the stack’s 
head-value, but also has the ability to clean up the context or 
evaluate pending applications after it is completed. How to evaluate 


struct SFold { 

void * *(*val)(struct Progremi *p, struct State *s, void *val); 
void *(*let_l)(struct Program *p, struct Ctxt *c, void *val); 
void *(*let_a)(struct Program *p, void *val, struct State *b); 
void *(*apply)(struct Program *p, void *a, struct State *b); 

}; 

void *unwind(struct Program *p, struct State *s, 

struct SFold *f, void *val) { 
struct State *rhs, *next, *sup; 
struct Ctxt *c; 
void *ret; 

while( (ret = f->val(p, s, val)) == NULL); 
val = ret; // Caution! f->val() may deref Var-s, 
sup = s->up; // and change ‘s->c, s->up’ 
c = s->c; 

while(sup != NULL) { 

next = sup->parent; 

val = unwind_context(p, &c, sup->end, f, val); 
val = f->apply(p, val, sup); // ascend application 
sup = next; 

} 

return unwind_context(p, &c, s->end, f, val); 

} 

void *unwind_context(struct Program *p, struct Ctxt **cp, 
struct Ctxt *end, struct SFold *f, void *val) { 
struct State *sb; 

struct Ctxt *next, *bend, *c = *cp; 
while(c != end) { 

sb = c->b; // just in case c is freed (e.g. by let_l) 
next = c->next; 
if(sb 1= NULL) 

bend = sb->end; //or end is modified... 
val = f->let_l(p, c, val); 

if(sb != NULL kk bend != c) { // unwind sub-contexts 
val = unwind_context(p, &next, bend, f, val); 
val = f->let_a(p, val, sb); 

} 

c = next; 

} 

*cp = c; 
return val; 

} 

Figure 4. Stack-unwinding algorithm, expressing a generic fold. 

This is the heart of the interpreter. 


pending applications is one major point of contention in existing 
applications. E] If the stack is to be substituted, or if the head value 
is strict, evaluation needs to be done. However, this work can also 
be delayed by creating new bindings on the context. This requires 
a generalized lifting operation. 

• val: Implement Eig.[T] returning NULL after appendStack and 
returning the current stack on Done. 

• letJ Ctxt c I nref c == 0 = stack_dtor c.t; stack_dtor c.b; unlink 
c 

• letj Ctxt c I otherwise = need (c.t) 

• let_a b = pass 

• apply b = need b 


5 


2015 / 9/24 



XX has clearly described how call-by-need can be implemented 
in terms of folds over this structure. Whenever a head-value is a 
variable reference, the binding for that variable is evaluated by¬ 
need, and the result is copied and substituted for the variable. 
Because all variable references are pointers in this scheme, no index 
re-numbering is required during the copy. 

Substitution can often be optimized to a direct move of the 
stack. This is possible when the variable is the only reference to 
its binder. We maintain reference counts by incrementing a con¬ 
text’s count whenever a variable is added during wind, and decre¬ 
menting reference counts whenever a variable is destroyed. Recur¬ 
sive references can not be optimized in this way, since substitu¬ 
tion might lead to more references. Recursive references are only 
counted for references outside the stack. Cyclic references require 
a special constructor (we use record types). Their counts are main¬ 
tained by saving the reference count before and restoring it after 
their right-hand side is evaluated. 

As shown by the use of stack_dtor, by-need evaluation proceeds 
in tandem with stack destruction. Further traversal steps continue 
up the tree, visiting all the binders in the context. After the stack 
bottom has been evaluated, all the needed right-hand sides should 
have been substituted. Hence, all matched let-bindings could de¬ 
stroyed, leaving only the open let-bindings. A refinement used in 
this work is to keep all bindings that have any references remain¬ 
ing. This way, exactly as much garbage collection is done as needed 

- so partial evaluation still results in a valid term. The correctness 
of this procedure relies critically on the fact that the tree is traversed 
from start to end. Because of this, each let-binding is visited imme¬ 
diately after all possible references to it have already been visited. 

Garbage Collection (stackjdtor) Destroying the stack can be 
done by de-allocating all its internal data structures. It only makes 
sense to do this with a visiting fold. We have taken special care to 
read the next few steps from the stack during the unfolding pro¬ 
cess so that this works. Since stacks maintain a strict tree ordering, 
automatic garbage-collection is not needed (or used in our imple¬ 
mentation) for stack structures. 

5. Benchmarks 

We give timings and heap usage below for the n-queens and tak 
benchmark on an Intel 2.5 GHz Core 17 running OSX, for an in¬ 
terpreter compiled with clang-602.0.53. Each test was checked for 
reproducibility and status was tabulated at each context creation. 
Timings were reported for the non-heap reporting versions. 

The 8-queens program runs in 6.9 seconds and makes 1.8 mil¬ 
lion context allocations, with a high-water mark of 2029 contexts 
and 7856 stacks. The 10-queens program runs in 221 s and makes 
45 million context allocations, with a high-water mark of 2778 con¬ 
texts and 10789 stacks (Fig.|^. The program utilized the deforested 
version of lists, with type 

List(a : ★) = (Xb : * — >■ (_ : a — ^ : 6 — >■ 6) — : 6 — >■ 6)(? : *) 

introduced by Gill, Launchbury, and Peyton-Iones. d Interest¬ 
ingly, that strategy is an incipient version of the tagless method 

- as it relies on replacing the list construction operators, cons and 
nil, with deconstruction functions. 

The tak benchmark is a stringent test of recursive calls. When 
run with our interpreter on a medium problem size (27 16 8), 
it makes 26.9 million context allocations. Nevertheless, its high- 
water mark was just 162 contexts and 1245 stacks (Fig. The 
benchmark ran in 119 s without any memoization. 

In contrast, the Glasgow Haskell Compiler IGHCl t^ uses 
highly optimized array STM and pinned memory to solve the 10- 
queens problem in just over 0.01 s, and tak in 0.09 s. For both 
problems, its heap-space usage remains essentially flat at 36 kbytes 
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Figure 5. Both benchmark programs are executed by our inter¬ 
preter in constant stack-space, as shown by contexts vs. run-time. 
In both cases, the number of stacks (and hence total running size) 
is proportional to the number of contexts. Corresponding sizes are 
given in the text - 738 kbytes for 10-queens and 77 kbytes for tak. 


throughout the run. It should be noted that GHC has much slower 
performance for these problems in interpreted execution mode, 
requiring 1.4 s to run 10-queens and 4.9 s to run tak. 

Our linked-list implementation of stacks and contexts requires 
48 bytes per context and 56 bytes per stack. This makes the high- 
water marks for 10-queens at 738 kbytes and tak at 77 kbytes. 
These compare favorably with GHC’s heap-space usage. 

6. Discussion 

The benchmark problems above clearly show that the interpretation 
strategy taken here is robust. Development versions of our inter¬ 
preter have successfully checked all the invariants listed in Sec. ?? 
at every major step of code interpretation. Although the run-times 
above are slightly disappointing, it is highly encouraging that the 
stack space usage parallels that used by GHC itself. It is notable 
that most naive functional language implementations face great dif¬ 
ficulty with these programs. 

This is an excellent starting point for more advanced evaluators 
for the genereric lambda-calculus with dependent types. Several 
improvements can likely be made to increase the execution speed, 
and to finally realize the automatic task-level parallelism that A- 
calculus has the potential to provide. 

The actual implementation recognizes Cons, Tup, and Com¬ 
mit (for a directory of named bindings), as constructors, provides 
several base data types, as well as provides a means for creating 
generalized algebraic datatvDes f2^ as Sym constructors. It imple¬ 
ments lO by recognizing stacks with lO head-values as construc¬ 
tors during functional evaluation, and as primitives during imper¬ 
ative reduction. !^ lO primitives return an extra flag indicating 
whether evaluation has completed with returnlO. The ST monad is 
implemented in a similar way, but can be invoked from within a 
functional computation by the runST primitive, which creates the 
initial copy and manages the state. It also provides a unique mech¬ 
anism for creating primitives and new packed binary data values 
from specially formatted Commit types. 

Elimination of most relinking steps As mentioned in the text 
describing the stack_dtor fold, a major source of inefficiency in the 
present work is the maintenaince of end pointers on each context. 
Since evaluation proceeds in tandem with context deletion, the 
stack must be constantly scanned to rewrite these pointers. If, 
instead, the end pointer of every right-hand side was contained as 
part of its context, then each stack could have a NULL end-pointer. 
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This would eliminate recursive re-writes, but require some other 
mechanism for creating de-Bruijn indices pointing outside the cur¬ 
rent stack during get_ast. One simple resolution is to track parent 
spines for each spine. Alternatively, the parent spine of each con¬ 
text could be used. The spine tree structure could be re-built by 
examining all Var pointers in each stack and creating a topological 
sort. 

Unboxing and Mining fold functions The usual machine im¬ 
plementation of the lambda-calculus uses continuation passing to 
jump back after evaluating an application right-hand side. The un¬ 
wind method is slightly more complicated, maintaining a state of 
paired binders. Nevertheless, it could be made into a closure and 
returned into in the same way. 

Once this step has been taken, each fold function over stacks 
can be inlined by turning all stacks into executable code. For ex¬ 
ample, the code for var ■<— Apply (var, get_ast (b) ) would be 
push_closure (Apply(var, . ), closure) followed by call b. For ar¬ 
guments that have statically known, machine-representable types, 
both the producer and consumer know can create code that makes 
use of unboxed data values - final form. 

Enhanced Type System Development As discussed in Sec. 
terms and types are syntactically very close. This makes unification 
much simpler, since unifying two terms in stack form just requires 
unifying the types of their open binders, and then unifying the bot¬ 
tom of the term with the smaller context with the remaining stack 
on the other. It is the unification solutions which make this process 
difficult. Each solution eventually gets written into a ? construc¬ 
tor. However, that solution must be lifted to a point in the context 
where it is reachable from both halfs of the unification problem. 
If the solution references variables, those must also be lifted. This 
reachability requirement translates to an occurs-check so that the 
solution cannot be recursive. It also dis-allows open binders (or 
stacks referencing open binders) from being lifted above the binder. 
This imposes a very helpful restriction on type annotations, namely 
that unification cannot automatically create types that depend on 
future, unknown variables. This requires a non-trivial edit in stack 
form which will be discussed in future work. 

Distributed Parallel Evaluation The interpreter built in this 
work can serialize syntax trees in initial form to a Google pro¬ 
tocol buffer format, and store them in a distributed hash table. The 
resulting hash table manages objects in much the same way as a git 
source-code repository. This automatic serialization of every term 
is made possible by the tagless-inspired strategy of this work. From 
this point, trivial parallelization of arbitrary code can be facilitated 
using a bag of tasks model. 

We note that a distributed, lazy, work-stealing scheduling strat¬ 
egy is also possible. If a term has a strict head-value, all its pending 
application stacks will need to be evaluated. By properties 1 and 2, 
all of the references from any of these applications are reachable 
from the context of the current stack. Therefore, only the bound 
right-hand sides need to be made available in case they are needed 
by a peer evaluating of one of those pending applications. Once 
placed in the table, the context can be replaced with a hash refer¬ 
ence. 

The work list starts out when each strict, pending application is 
made available as code and placed in an active queue. In general, 
peers can start by dequeuing any object from the active queue, 
and activating any right-hand sides as needed. Peers remain lazy, 
evaluating references only by-need. On completion, the evaluated 
term is stored in a completed work list, along with a mapping from 
the hash of the original to the hash of the evaluated term. Generally, 
peers may cause contention by competing to dequeue common 
references. Nevertheless, re-evaluation does not cause harm, and 


this organization structure gives a implementable distributed lazy 

evaluator. 
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